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Advances  in  statistical  machine  learning  encourage  language-independent  approaches  to  linguistic  technology 
development.  Experiments  in  “porting”  technologies  to  handle  new  natural  languages  have  revealed  a  great 
potential  for  multilingual  computing,  but  also  a  frustrating  lack  of  linguistic  resources  for  most  languages. 
Recent  efforts  to  address  the  lack  of  available  resources  have  focused  either  on  intensive  resource  development 
for  a  small  number  of  languages  or  development  of  technologies  for  rapid  porting.  The  Linguistic  Data 
Consortium  recently  participated  in  an  experiment  falling  primarily  under  the  first  approach,  the  surprise 
language  exercise.  This  article  describes  linguistic  resource  creation  within  this  context,  including  the  overall 
methodology  for  surveying  and  collecting  language  resources,  as  well  as  details  of  the  resources  developed 
during  the  exercise.  The  article  concludes  with  discussion  of  a  new  approach  to  solving  the  problem  of  limited 
linguistic  resources,  one  that  has  recently  proven  effective  in  identifying  core  linguistic  resources  for  less 
common  studied  languages. 
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1.  INTRODUCTION 


Recent  applications  of  statistical  machine-learning  algorithms  to  linguistic  technologies 
have  produced  systems  that  are  capable  of  both  learning  and  improving  their 
performance  when  exposed  to  sufficient  quantities  of  appropriately  labeled  training  data. 
Experiments  in  porting  such  technologies  have  revealed  both  the  general  potential  for 
intensively  multilingual  computing  and  the  specific  cases  in  which  simplifying 
assumptions  and  implementation  decisions  block  trae  generality  [Psutka  et  al.  2003; 


Byrne  et  al.  1999].  It  has  become  clear,  however,  that  the  major  impediment  to  creating 
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linguistic  technologies  in  more  than  a  handful  of  the  most  common  languages  is  the 
dearth  of  training  data  [Fumi  2001;  Kirchoff  et  al.  2002]. 

Attempts  to  address  this  lack  of  available  resources  have  taken  one  of  two 
approaches:  (1)  intensive  effort  on  a  small  number  of  new  languages  [Cieri  and 
Liberman  2002]  and  (2)  development  of  technologies  that  may  be  rapidly  ported  to  new 
languages  [Al-Onaizan  et  al.  1999].  In  the  sections  that  follow  we  describe  recent  and 
ongoing  work  at  the  University  of  Pennsylvania's  Linguistic  Data  Consortium  as  part  of 
the  surprise  language  exercise,  an  experiment  in  rapid  linguistic  resource  and 
technology  development  largely  falling  under  the  first  approach.  We  conclude  with  a 
discussion  of  yet  a  third  approach  to  the  problem  of  resource  scarcity,  motivated  in  part 
by  our  experiences  in  the  surprise  language  exercise.  We  describe  our  evolving 
methodology  and  outline  a  plan  of  action,  already  in  its  beginning  stages,  that  promises 
to  provide  core  resources  for  a  large  number  of  critical  languages.  We  intend  this  article 
as  a  call  for  international  collaboration  among  resource  providers  and  technology 
developers  to  resolve  the  language  resource  availability  problem. 

2.  DEFINITIONS 

One  should  note  at  the  outset  that  the  terms  "core  language  resources"  and  "critical  lesser- 
studied  languages"  are  variably  defined  among  the  scholars  who  use  them. 

Consider  first  the  issue  of  critical  but  lesser-studied  languages.  According  the 
Ethnologue  [Grimes  2003],  there  are  nearly  7000  languages  spoken  in  the  world  today. 
The  thought  of  creating  resources  for  all  of  them  boggles  the  imagination  and  represses 
further  discussion  or  planning.  Flere  we  propose  to  focus  on  a  manageable  subset,  those 
that  are  the  native  languages  of  at  least  one  million  people.  This  reduces  our  scope  to 
some  300  languages.  Nearly  80%  of  the  world’s  inhabitants  speak  one  of  these  languages 
natively.  Figure  1  plots  the  cumulative  distribution  of  the  world’s  inhabitants  by  their 


native  languages.  This  graph  shows  that  most  of  the  world’s  inhabitants  are  native 
speakers  of  the  320  most  common  languages.  It  is  clear  that  creating  resources  for  this  set 
of  languages  provides  the  biggest  benefit  for  the  effort. 
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Fig.  1.  Cumulative  distribution  of  the  world’s  inhabitants  (y-axis)  by  native  language  (x-axis).  The 
320  most  common  native  languages  cover  80%  of  the  world’s  inhabitants. 

As  for  core  language  resources,  we  define  these  as  the  resources  necessary  for 
translingual  information  access  technologies.  Such  resources  include  texts,  parallel  texts, 
translation  lexicons,  entity  databases  plus  a  range  of  manual  annotations  designed  to 
provide  training  material  as  well  as  benchmark  test  data.  Clearly,  this  list  provides  for 
only  a  subset  of  the  desirable  technologies;  however,  we  believe  that  these  are  the  critical 
resources  for  translingual  information  access,  as  well  as  being  those  resources  that  are 
currently  within  our  grasp. 


3.  THE  SURPRISE  LANGUAGE  EXERCISE 

The  Linguistic  Data  Consortium  recently  participated,  along  with  several  other  research 
sites,  in  an  experiment  known  as  the  surprise  language  exercise.  The  exercise  challenged 


sites  to  identify  or  create  linguistic  resources  and  develop  working  technology  for  a 
previously  untargeted  language  within  a  constrained  time  span. 

The  exercise  was  part  of  the  DARPA  program  in  Translingual  Information  Detection, 
Extraction  and  Summarization  (TIDES),  which  requires  computer-readable  resources 
sufficient  to  support  translingual  information  processing  tasks  [Wayne  2002].  Although 
TIDES  had  adopted  an  early  focus  on  common  languages  such  as  English,  Chinese,  and 
Arabic  where  ready  availability  of  data  would  allow  research  to  continue  relatively 
unfettered,  the  porting  of  TIDES  technologies  to  less  common  languages  has  always  been 
a  desideratum  of  the  program.  For  the  primary  focus  languages,  TIDES  has  already 
produced  a  very  rich  set  of  resources,  described  elsewhere  [Cieri  and  Liberman  2002; 
LDC  2003].  In  2003,  the  program  began  to  address  the  need  for  technologies  in  less 
common  languages  through  experiments  in  rapid  technology  porting  where  data 
collection,  resource  creation,  and  technology  development  take  place  simultaneously 
within  a  very  short  time  period  (i.e.,  one  month). 

During  the  surprise  language  exercise  described  below,  LDC's  primary  role  was  to 
coordinate  development  and  dissemination  of  linguistic  resources  for  the  target  language. 
Once  the  desired  resources  had  been  obtained  or  created,  technology  sites  put  them  to  use 
in  developing  NLP  tools  for  statistically-based  machine  translation,  topic  detection  and 
tracking,  cross-lingual  information  retrieval,  information  extraction,  and  summarization. 
While  evaluation  of  this  work  is  ongoing,  preliminary  results  are  promising. 

4.  PREPARATION:  A  LANGUAGE  RESOURCES  SURVEY 

In  preparation  for  the  surprise  language  exercise,  LDC  staff  designed  and  began  to 
implement  a  survey  of  language  resources  for  the  320  most  common  languages.  A 
complete  description  of  the  survey  would  take  us  beyond  the  scope  of  the  present  article, 
but  the  survey  questions  explore  the  structural  features  of  a  language,  the  demographic 


features  of  its  speakers  and  the  availability  of  linguistic  resources,  digital  or  otherwise,  to 
support  technology  development.  A  linguist  completes  the  questions  of  the  survey  in  an 
order  that  allows  quick  scoring  of  languages  according  to  their  compatibility  with  the 
kinds  of  technology  we  hope  to  support.  For  example,  one  of  the  first  questions  is 
whether  the  language  is  written.  There  are  several  languages  with  more  than  one  million 
speakers  but  which  have  no  tradition  of  literacy,  making  them  impractical  targets  for 
technologies  that  rely  on  large  volumes  of  written  material.  LDC  has  completed  the 
survey  in  part  or  wholly  for  over  150  languages,  and  plans  to  continue  the  survey  for  the 
remainder  of  the  320  as  time  and  funding  allow. 

Figure  2  shows  a  sample  of  a  summary  report  produced  from  the  full  survey.  The 
categories  across  the  top  of  the  spreadsheet  show  some  of  the  items  covered  by  the 
survey  that  were  of  special  importance  for  the  choice  of  surprise  language,  including  the 
main  country  where  the  language  is  spoken,  number  of  native  speakers,  whether  the 
language  is  written,  whether  the  survey  found  news  text  and  other  resources  in  electronic 
form,  whether  the  language  has  a  “complex”  morphology,  and  so  on.  The  final  column 
displays  a  numeric  summary  of  the  “true”  and  “false”  answers  to  each  question  (and  in 
some  cases,  a  “questionable”  answer);  this  is  used  to  sort  the  languages  by  candidate 
status. 

The  actual  survey  report  includes  details  for  each  of  the  categories  displayed  in  the 
summary  report,  as  well  as  for  a  number  of  other  categories.  For  example,  if  the  answer 
to  (electronic)  “News  text”  is  tme,  the  detailed  survey  report  would  list  URLs  for  the 
news  websites  or  other  sources  of  news  text  that  we  had  identified. 

We  believe  that  the  results  of  this  survey  will  be  of  interest  to  a  wide  variety  of  users; 
moreover,  others  will  be  able  to  fill  in  gaps  in  our  knowledge  to  further  enrich  the  survey. 
While  our  intention  is  to  eventually  publish  the  survey,  we  have  not  yet  determined  the 


manner  in  which  it  will  be  made  available. 
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Fig.  2.  Sample  from  survey  of  language  resources. 


Although  work  on  the  survey  began  before  the  one  month  allocated  for  the  surprise 
language  exercise,  we  argued  that  such  a  head  start  was  in  fact  appropriate  because  (1)  it 
was  conducted  in  a  general  way  without  knowing  which  language  would  be  the  specific 
target  of  the  experiment;  (2)  it  allowed  TIDES  sponsors  to  select  from  the  set  of 
languages  where  rapid  porting  was  an  actual  possibility;  and  (3)  it  changed  the  terrain 
both  fundamentally  and  permanently  for  those  who  would  port  linguistic  technologies  to 
less  common  languages.  In  other  words,  once  the  survey  was  made  available,  the  task  of 
rapid  porting  became  easier  for  a  large  number  of  languages.  It  was  this  realization,  along 
with  the  success  of  the  experiment  described  in  the  section  below,  that  impels  us  toward 
the  more  ambitious  proposal  presented  in  the  article's  final  section. 


5.  THE  DRY-RUN:  AN  EXPERIMENT  IN  RESOURCE  DISCOVERY 


In  March  2003,  a  surprise  language  dry-run  was  organized  by  LDC  to  assess  the 
feasibility  of  the  full-scale  experiment  and  to  answer  basic  questions  about  how  best  to 
administer  a  large-scale,  collaborative,  rapid  resource,  and  technology-development 
exercise.  Neither  LDC  nor  any  other  participating  sites  knew  in  advance  what  language 
would  be  selected  by  TIDES  sponsors  for  the  dry-run.  On  March  5,  participants  were 
notified  that  the  target  was  Cebuano,  a  language  of  the  Philippines.  Prior  searches  for 
computer-readable  data  on  this  language  had  turned  up  only  a  bible  and  one  small  news 
text  archive.  (As  it  turned  out,  this  news  archive  contained  fewer  than  10,000  Cebuano 
words.)  In  addition,  we  knew  of  several  printed  dictionaries  and  grammars. 

Within  eight  hours  of  the  beginning  of  the  exercise,  a  team  of  eight  linguists  and 
programmers  at  LDC  had  discovered  250,000  words  of  news  texts  in  Cebuano,  several 
other  small  monolingual  and  bilingual  Cebuano  texts,  and  no  fewer  than  four  computer- 
readable  lexicons,  one  of  which  turned  out  to  have  on  the  order  of  24,000  entries 
(lexemes).  Other  sites  working  on  the  exercise  identified  resources  as  well.  There  was  a 
good  deal  of  overlap  among  what  different  sites  discovered,  giving  us  some  confidence 
that  we  had  located  a  reasonably  complete  set  of  resources. 

The  disparity  between  what  we  were  able  to  find  before  versus  during  the  exercise  is 
attributable  in  part  to  the  greater  effort  during  the  exercise:  a  few  person-hours  before, 
eight  hours  times  eight  participants  during.  But  perhaps  more  important  was  the  search 
methodology.  Prior  to  the  exercise,  we  had  done  searches  for  the  word  “Cebuano”  in 
combination  with  other  English  words,  such  as  “lexicon,”  “dictionary,”  or  “news.”  But 
this  missed  some  resources  that  were  labeled  with  alternative  names  for  Cebuano 
(Bisayan  and  Visayan),  as  well  as  resources  that  were  not  labeled  as  dictionaries,  etc. 

During  the  exercise,  we  employed  a  different  method,  suggested  by  Mark  Liberman 
(see  also  Ghani  et  al.  [2001]).  Once  we  had  found  a  handful  of  pages  in  Cebuano,  we  did 
a  count  of  word  forms.  We  fed  two  of  the  most  common  word  forms  (the  words  for  “this” 


and  “that”)  back  into  search  engines,  and  this  quickly  led  to  more  discoveries.  This 
technique,  using  these  and  other  common  words,  led  to  most  of  our  discoveries.  We  also 
used  lists  of  words  and  tetragrams  as  queries  to  electronic  lexicons  of  Cebuano  to 
determine  the  extent  of  their  coverage. 

Since  the  Cebuano  dry-run,  we  have  experimented  with  the  technique  of  searching  for 
resources  using  seed  words  in  new  languages.  Preliminary  experiments  have  been 
promising,  even  with  languages  having  extensive  inflectional  morphology.  For  instance, 
Tzeltal  (a  Mayan  language  of  Mexico),  Swahili  (east  Africa),  and  Shuar  (a  Jivaroan 
language  of  Ecuador)  all  have  substantial  inflectional  morphology,  including  both 
prefixes  and  suffixes.  Nevertheless,  searches  with  a  few  common  nouns  return  numerous 
hits,  most  of  which  are  indeed  texts  in  the  target  languages. 

Seed  terms  can  be  extracted  from  an  initial  set  of  texts,  as  we  did  with  Cebuano  and 
Swahili;  or  they  can  come  from  dictionaries,  as  was  the  case  for  Tzeltal  and  Shuar.  In  the 
case  of  infiectionally  rich  languages,  when  the  word  forms  are  extracted  from  texts,  they 
will  obviously  be  inflected.  Dictionary  citation  forms  for  most  languages  are  normally 
inflected  forms  as  well,  although  dictionary  writers  commonly  strive  to  use  the  least 
inflected  form  of  a  word.  But  for  some  languages,  dictionaries  traditionally  use  citation 
forms  that  are  bare  roots  which  can  never  appear  in  texts  in  that  form.  Using  this  search 
technique  with  such  languages  would  require  consulting  a  grammar  to  create  inflected 
forms  to  be  fed  into  search  engines. 

Word  length  is  also  a  consideration.  Words  that  are  just  two  or  three  characters  long 
may  turn  up  too  frequently  in  other  languages  to  be  of  use.  Where  there  are  closely 
related  languages,  even  longer  words  may  give  rise  to  spurious  hits  in  the  related 
languages.  Ideally,  we  would  have  lexicons  of  a  100  or  more  languages,  and  use  these 
lexicons  to  eliminate  candidate  search  terms  that  appear  in  other  languages.  (Small 


lexicons  would  probably  serve  this  purpose  better  than  large  lexicons,  since  homographs 
across  languages  are  only  problematic  if  they  are  common  in  the  other  languages.) 

Another  issue  is  the  fact  that  for  minority  languages,  the  writing  system  may  have 
undergone  recent  changes.  An  older  dictionary  may  contain  words  that  are  now  spelled 
differently,  and  its  use  will  therefore  result  in  a  lack  of  hits.  This  was  true  of  some  printed 
dictionaries  of  Cebuano;  fortunately,  the  spelling  changes  required  were  mechanical. 

Encodings  present  additional  issues.  Typing  in  search  terms  in  an  encoding  requires 
choosing  a  keyboard  and  encoding,  and  some  encodings  may  not  be  supported.  Much 
simpler  is  copying  and  pasting  seed  terms  from  a  web  page  in  the  appropriate  encoding. 

However,  some  languages  use  multiple  encodings.  For  instance,  several  languages  of 
eastern  Europe  have  more  than  one  commonly  used  encoding.  A  worst  case  is  Amharic 
(of  Ethiopia),  with  over  70  encodings  (see  the  LibEth  project  at 
http://libeth.sourceforge.net);  several  of  which  are  commonly  used  on  web  pages.  In 
order  to  get  a  sufficiently  broadly  based  corpus,  it  may  therefore  be  necessary  to  enter 
search  terms  in  multiple  encodings. 

One  additional  issue  we  faced  during  the  Cebuano  dry-run  was  that  of  language 
identification.  Cebuano  is  related  to  a  number  of  other  Philippine  languages  (and  more 
distantly  to  other  Malayo-Polynesian  languages),  and  it  can  therefore  be  difficult  for 
nonspeakers  to  tell  whether  texts  are  actually  in  Cebuano.  We  addressed  this  problem  by 
looking  up  a  variety  of  words  from  the  texts  in  question  in  both  printed  and  computer- 
readable  Cebuano  dictionaries.  This  only  works  for  words  with  no  inflectional  affixes,  so 
the  recall  of  this  method  is  limited  by  inflectional  morphology.  Cebuano  has  minimal 
morphology  (verbs  are  inflected  with  prefixes,  infixes,  and  suffixes,  but  nouns  are  for  the 
most  part  uninflected);  hence  we  were  reasonably  certain  that  our  texts  were  in  Cebuano. 
Lingering  doubts  having  to  do  with  languages  that  are  very  closely  related  to  Cebuano 
were  removed  by  having  Cebuano  speakers  check  the  texts. 


In  sum,  during  this  short  exercise,  we  were  able  to  locate  a  surprising  number  of 
resources  in  a  short  period  of  time,  giving  us  confidence  that  for  a  full-scale  exercise  we 
would  be  able  to  find  sufficient  resources  in  any  language  that  appeared  practical  based 
on  our  preliminary  survey. 

6.  THE  TEST:  THE  SURPRISE  LANGUAGE  EXERCISE 

A  few  months  after  the  work  described  above,  the  full  surprise  language  exercise  took 
place.  Unlike  the  dry-run,  which  was  designed  primarily  to  evaluate  the  process  for 
rapidly  locating  and  disseminating  linguistic  resources,  the  full  exercise  would  put  those 
resources  to  use  in  the  development  of  various  natural  language  processing  technologies. 
On  June  2,  2003,  participants  learned  that  the  surprise  language  was  Hindi.  Both  the 
process  and  the  results  of  the  full  exercise  were  significantly  different  from  the  dry-run. 
For  one,  the  amount  of  text  written  in  Hindi  available  on  the  web  is  orders  of  magnitude 
greater  than  for  Cebuano.  Thus  we  were  not  faced  with  the  difficulty  of  finding  Hindi 
text,  but  rather  with  processing  vast  quantities  of  it. 

A  second  difference — and  one  which  loomed  ever  larger  in  our  minds  during  the 
course  of  the  experiment — arose  from  the  fact  that  while  Cebuano  is  written  with  a  Latin 
alphabet,  and  can  therefore  be  encoded  with  the  ASCII  character  set,  Hindi  uses  the 
Devanagari  writing  system.  Indian  computer  scientists  have  therefore  developed  an  8-bit 
character  encoding  known  as  ISCII  (see  http://brahmi.sourceforge.net/docs/iscii91.pdf  for 
a  draft  standard),  which  reportedly  forms  the  basis  of  the  Unicode  implementation  for 
Hindi.  In  addition,  there  are  several  different  Romanization  standards. 

Unfortunately,  we  encountered  no  web  site  from  which  to  harvest  text  that  actually 
used  ISCII,  and  neither  of  the  news  sites  that  used  Unicode  (Voice  of  America  and  the 
BBC)  was  based  in  India.  Instead,  virtually  every  Hindi  news  site  had  its  own  more  or 
less  proprietary  8-bit  font,  and  each  font  used  its  own  unique  encoding.  Indeed,  several 


web  sites  used  more  than  one  font  and/or  encoding;  the  India  parliament  requires 
downloading  five  different  fonts,  although  some  of  these  appear  to  use  the  same 
encoding. 

In  order  to  develop  NLP  tools  that  would  work  with  text  from  different  web  sites,  we 
were  forced  to  convert  all  the  text  to  a  standard  encoding,  for  which  we  chose  Unicode 
(UTF-8). 

Written  Hindi  has  around  50  consonants  and  vowels,  with  no  upper/lower  case 
distinctions.  This  would  easily  fit  into  a  7-bit  character  set  (or  into  the  upper  128  code 
points  of  an  8-bit  character  set).  However,  there  are  variant  forms  of  many  consonants 
used  when  these  appear  in  consonant  clusters,  as  well  as  variant  forms  of  vowels.  The 
ISCII  character  set  assumes  intelligent  font  rendering,  so  that  only  a  single  form  of  each 
consonant  or  vowel  needs  to  be  encoded.  But  most  designers  of  proprietary  fonts  have 
encoded  variant  character  forms,  electing  instead  to  put  the  intelligence  into  keyboard 
drivers.  The  result  is  that  not  only  are  the  code  points  different  for  each  font,  the  set  of 
characters  which  are  actually  encoded  are  to  some  extent  different — rendering  the 
encoding  conversion  process  nontrivial. 

An  analogy  to  this  problem  in  Roman  character  encodings  is  accented  characters, 
which  under  some  conventions  are  treated  as  unitary  characters,  while  other  conventions 
treat  them  as  a  base  character  plus  diacritical  marks.  Choices  between  unitary  and 
multigraph  representation  are  prevalent  in  Hindi,  and  different  alternatives  are  commonly 
used  in  different  encodings. 

Likewise,  some  English  typesetting  conventions  provide  special  treatment  for  certain 
character  sequences,  such  as  “fi”:  the  shape  of  the  individual  characters  may  be  slightly 
modified  in  such  ligatures.  Ligature-like  characters  are  abundant  in  Hindi,  and  encodings 
often  provide  hard-coded  ligature  forms;  again,  the  decisions  as  to  which  ligature  forms 
to  hard-code  are  often  made  differently  for  different  Hindi  encodings. 


Debugging  character-set  conversion  also  proved  difficult.  We  often  found  that  a 
converter  that  appeared  to  work  on  a  small  text  sample  failed  to  completely  convert  larger 
texts,  leaving  the  result  peppered  with  errors,  both  visible  and  covert.  (By  “covert”  errors 
we  mean  errors  that,  while  not  visible  in  displayed  text,  result  in  differences  in  the 
underlying  sequence  of  code  points,  and  would  therefore  affect,  e.g.,  dictionary  lookup.) 
Some  of  these  errors  were  due  to  bugs  in  our  rapidly  developed  converter,  while  others 
were  due  to  nonstandard  characters  or  character  sequences  in  the  text  (e.g.,  in  loan 
words),  or  simply  typos  (which  were  surprisingly  frequent  in  some  texts). 

In  sum,  character-set  conversion  turned  out  to  be  a  much  greater  problem  than  we  had 
anticipated.  Despite  these  difficulties,  a  significant  number  of  resources  were  identified, 
converted,  and  further  processed  by  LDC  with  much  help  from  the  other  sites 
participating  in  the  exercise. 

While  the  resource  discovery  period  for  Cebuano  lasted  just  a  few  hours,  the  same 
process  extended  into  the  final  days  of  the  Hindi  exercise.  This  heavily  collaborative 
effort  garnered  at  least  13  lexicons  (both  general  and  domain-specific),  resulting  in  nearly 
30,000  unique  lexical  entries;  17  sources  of  monolingual  text;  over  30  sources  of 
bilingual  text,  including  the  bible  and  other  literature  but  also  several  news  and 
government  websites  plus  several  bilingual  corporate  and  technology  websites. 
Participants  also  found  a  general  morphological  parser  and  a  number  of  entity  lists, 
including  telephone  directories,  a  geographical  place  name  list,  and  government  voter  and 
personnel  lists. 

During  the  Cebuano  exercise,  we  had  experimented  with  the  use  of  a  “wiki”  (publicly 
editable)  web  site  for  purposes  of  resource  dissemination,  and  with  a  blog  for  rapid 
communication,  but  neither  technique  seemed  to  work  terribly  well.  In  particular,  we 
encountered  problems  with  conflicting  multiple  edits,  presumably  caused  by  the  very 
rapid  posting  and  the  need  for  continual  updating  of  information.  As  a  result,  it  became 


necessary  for  each  site  to  have  its  own  web  page  on  the  wiki,  thereby  eliminating  the 
supposed  advantage  of  collaborative  editirrg.  Moreover,  with  multiple  pages  on  the  wiki, 
a  visitor  in  search  of  a  particular  resource  had  to  consult  numerous  pages,  making  the 
sharing  of  resources  somewhat  cumbersome. 

Accordingly,  for  the  Hindi  exercise  we  chose  a  centrally  located  and  edited  repository 
for  all  shared  data,  which  worked  in  the  following  way. 

When  a  site  identified  a  resource,  it  was  announced  through  an  email  listserv  to  all 
participants;  the  emails  could  also  be  accessed  from  a  web  archive.  URLs  of  all  “found” 
resources  were  posted  on  a  web  page  whose  URL  was  made  available  to  all  participants. 

Particularly  interesting  items  on  the  found  resources  list  were  then  selected  by  one  of 
the  participating  sites  for  download  and  further  processing.  In  the  case  of  text  resources, 
processing  might  involve  identifying  encoding,  transliteration  into  a  standard  encoding 
(in  the  case  of  Hindi),  stripping  HTML  tags,  and  tokenization.  In  some  cases  multiple 
versions  of  resources  might  be  provided,  e.g.,  a  version  of  a  text  with  improved  encoding 
conversion,  or  a  lexicon  consisting  of  merged  lexical  entries  from  a  number  of  found 
lexicons. 

In  addition  to  processing  found  resources,  some  sites  created  new  resources  from 
scratch,  such  as  morphological  stemmers  or  encoding  converters. 

Both  processed  and  created  resources  were  treated  in  the  same  way.  A  site  wishing  to 
provide  such  a  resource  had  two  choices:  it  could  either  announce  it  in  the  email  list  and 
say  that  the  fde  would  be  distributed  by  LDC,  or  it  could  announce  it  but  make  the  fde 
available  at  its  own  web  site.  The  latter  was  a  faster  way  to  make  available  especially 
important  resources,  as  LDC  sometimes  found  itself  a  couple  days  behind  in  processing 
the  resources  found  during  the  Hindi  exercise;  but  it  had  the  disadvantage  of  being  more 
difficult  for  potential  users  to  find  and  download.  In  either  case,  the  file  was  (eventually) 


made  available  from  a  second,  password-protected  web  page  at  LDC,  providing  a  central 
download  site. 

The  process  of  making  submitted  resources  available  from  the  LDC  website  was 
essentially  a  matter  of  validating  the  resource,  ensuring  that  its  contents  and  format  were 
documented,  and  entering  the  information  into  a  database  system.  This  resulted  in 
something  of  a  bottleneck,  with  one  or  two  individuals  at  LDC  (who  were  also 
performing  many  other  critical  surprise  language  tasks)  logging  each  of  the  submitted 
resources.  This  difficulty  might  have  been  avoided  by  allowing  remote  sites  to  upload 
the  resource  files  and  enter  the  metadata  into  the  database  themselves,  but  at  some  cost  to 
consistency.  A  simpler  solution  might  have  been  to  have  someone  at  LDC  whose  sole 
task  was  resource  validation  and  logging. 

The  reason  for  password  protection  on  the  processed  resources  was  that  much  of  the 
raw  data  harvested  for  the  surprise  language  exercise  was  drawn  from  commercial  data 
providers.  While  use  of  this  data  for  the  purposes  of  the  exercise  itself  can  be  seen  as 
falling  within  the  context  of  fair  use,  this  limits  access  to  the  data  to  those  TIDES  sites 
participating  in  the  exercise.  LDC  is  currently  pursuing  intellectual  property  rights 
negotiations  with  data  providers  in  order  to  secure  distribution  rights,  so  that  much  of  the 
data  developed  during  surprise  language  can  eventually  be  made  available  to  a  wider 
community  of  linguistic  researchers,  educators,  and  technology  developers. 

As  mentioned  earlier,  we  also  used  teleconferencing  to  work  through  issues:  daily  at 
first,  then  two  or  three  times  weekly.  Teleconferencing  turned  out  to  be  a  highly  effective 
supplement  to  our  other  forms  of  communication,  particularly  for  discussing  problems 
with  the  resource  collection  process. 


7.  RAPID  LINGUISTIC  RESOURCE  CREATION 


Not  all  required  resources  were  immediately  available  on  the  web.  For  both  the  dry-run 
and  the  full  exercise,  after  existing  stores  of  data  for  the  target  language  had  been 
identified  and  harvested,  human  annotators  worked  to  create  topic -relevance  judgments, 
manual  summaries,  entity-tagged  texts,  aligned  parallel  text,  and  a  host  of  other 
resources.  Moreover,  human  annotators  were  needed  to  create  answer  keys  for  the 
benchmark  test  data  used  in  evaluating  the  surprise  language  technology. 

For  the  Cebuano  dry-run,  LDC  resource  creation  focused  on  general  resources: 
sentence-aligned  bilingual  text,  entity-tagged  data,  and  morphological  parsers.  During 
the  month-long  Flindi  exercise,  LDC  worked  with  other  sites  to  produce  not  only  general 
resources  but  also  substantial  quantities  of  annotated  and  unannotated  training  and  test 
data  to  support  the  full  range  of  TIDES  technologies:  information  extraction,  detection, 
summarization,  and  machine  translation.  LDC  also  defined  the  training  and  evaluation 
corpora  for  each  task  and  worked  with  the  US  National  Institute  of  Standards  and 
Technology  (NIST)  to  distribute  this  data  to  participating  sites,  enabling  NIST  to  then 
evaluate  system  performance  against  stable  ground-truth  data  labeled  by  human  judges. 

In  preparation  for  the  surprise  language  experiments,  LDC  had  created  basic 
annotation  tools  and  streamlined  annotation  guidelines  that  would  allow  annotators  to 
make  rapid  progress  on  each  task  with  minimal  training.  Platform-independent, 
multilingual  annotation  tools  were  developed  for  the  exercise,  primarily  utilizing  the 
Annotation  Graph  Toolkit  or  AGTK  [Bird  and  Liberman  2001].  The  tools  take 
tokenized,  UTF-8  encoded  text  as  input  and  save  annotation  records  as  stand-off  markup. 
This  approach  also  allowed  annotation  work  to  be  distributed  across  multiple  sites. 
Particularly  in  the  case  of  Flindi,  both  the  pre-existing  annotation  tools  and  the  process 
for  creating  manually  tagged  data  had  to  be  substantially  revised  to  handle  the  encoding 
issues  described  above.  This  reduced  the  amount  of  time  ultimately  available  for  creation 


of  annotated  data,  which  is  reflected  in  both  the  quality  and  the  overall  quantity  of  the 
resources  created  for  Hindi. 

News  texts  labeled  for  topic  relevance  are  an  important  resource  for  information 
retrieval  and  related  technologies.  During  the  Hindi  exercise,  LDC  annotators  engaged  in 
topic  development  and  relevance  assessment  to  support  cross-language  information 
retrieval  (CLIR)  and  topic  detection  and  tracking  (TDT).  For  both  evaluation  areas,  LDC 
defined  a  training  corpus  consisting  of  news  texts  drawn  from  the  Hindi  found  resources. 
Native  Hindi-speaking  annotators  then  scanned  the  corpus  and  selected  15  broad  (theme- 
based)  topics  and  15  narrow  (event-based)  topics.  Annotators  created  profiles  for  each  of 
the  resulting  topics,  consisting  of  a  title,  definition,  and  narrative  plus  a  set  of  query 
terms.  Each  topic  profde  was  also  translated  into  English  from  the  original  Hindi. 

Research  sites  participating  in  CLIR  were  given  topic  profiles  for  the  15  broad  topics 
to  use  as  training  data.  Sites  used  these  training  topics  to  index  the  news  corpus  for  topic 
relevance.  Human  annotators  then  read  and  labeled  the  news  stories  in  the  resulting 
relevance-ranked  lists  in  order  to  establish  ground-truth  for  each  topic.  During  the  Hindi 
exercise,  LDC  annotators  labeled  a  total  of  1710  documents  for  CLIR  topic  relevance. 

For  the  TDT  evaluation,  sites  were  provided  not  with  the  topic  profiles,  but  with  four 
on-topic  training  documents  for  each  of  the  15  event-based  topics.  Systems  were  then 
required  to  detect  all  other  on-topic  documents  in  the  Hindi  corpus.  In  addition,  1 1  of  the 
15  topics  were  selected  for  cross-language  detection  in  English.  LDC  annotators  worked 
with  annotators  at  the  University  of  Massachusetts  Amherst  to  complete  topic 
development  and  relevance  assessment  of  the  sites'  submissions. 

Topic -relevance  annotation  was  completed  using  LDC's  existing  topic-tagging  toolkit 
developed  previously  for  TDT,  TREC,  and  related  projects,  and  customized  for  the 
surprise  language  exercise  to  handle  Hindi  data. 


Although  web  pages  can  frequently  be  mined  for  lists  of  certain  kinds  of  entities 
(names  of  government  officials  and  place  names,  for  example),  texts  in  which  named 
entities  have  been  tagged  are  virtually  unobtainable  on  the  web.  This  kind  of  training 
data  is  required  for  information-extraction  technology  development,  thus  necessitating 
manually  annotated  training  data. 

The  named  entity  task  for  surprise  language  utilized  a  subset  of  the  Message 
Understanding  Conference  (MUC)  named  entity  annotation  guidelines  [Chincor  1997], 
excluding  temporal  expressions  and  number  expressions  from  annotation.  Annotators 
focused  instead  on  three  named  entity  types:  organizations,  consisting  of  named 
corporate,  governmental,  or  other  organizational  entities;  persons,  consisting  of  named 
persons  or  families;  and  locations,  which  can  be  politically  or  geographically  defined 
(e.g.,  cities,  countries,  mountain  ranges,  bodies  of  water).  During  the  surprise  language 
exercise,  annotators  at  BBN,  NYU,  and  LDC  created  over  430,000  words  of  named  entity 
training  data,  including  some  data  that  was  annotated  twice  to  establish  interannotator 
agreement  rates.  Annotation  was  performed  using  the  AGTK  named  entity  tagging  tool 
that  LDC  had  developed  for  the  exercise.  The  tool  allowed  annotators  to  swipe  over  a 
region  of  text,  and  then  use  the  mouse  to  select  the  appropriate  entity  type  from  a  pull¬ 
down  menu.  Annotated  text  is  displayed  with  color-coded  underlining. 


Fig.  3.  Simple  named  entity  annotation  tool. 


In  addition  to  the  training  data  described  above,  LDC  also  defined  an  evaluation 
corpus  of  25  Hindi  news  documents.  These  documents  were  subject  to  additional 
processing  and  manual  validation  to  ensure  appropriate  news  content  and  consistent 
encoding.  Despite  this  extra  attention  some  minor  encoding  anomalies  remained;  this 
was  an  unfortunate  side  effect  of  the  fact  that  all  processing  and  manual  validation  of  the 
test  data  had  to  be  completed  in  just  a  few  hours  at  the  very  end  of  the  exercise  through 
intensive  collaboration  between  non-Hindi  programmers  and  non-programmer  Hindi 
speakers. 

The  resulting  Hindi  test  corpus  was  annotated  in  multiple  ways  to  provide  ground- 
truth  data  for  several  evaluations.  Annotators  tagged  the  data  for  named  entities  for  the 
extraction  evaluation,  and  four  independent  annotators  each  created  10-word  summaries 
in  English  for  each  of  the  test  documents  to  support  the  summarization  evaluation. 


Bilingual  texts  are  another  critical  resource.  For  purposes  of  machine  translation 
(MT),  the  best  bilingual  training  text  belongs  to  the  same  genre  as  the  text  that  the  MT 
programs  are  expected  to  translate — in  the  case  of  the  surprise  language  exercise,  news 
text.  We  found  some  bilingual  news  text  for  Cebuano,  but  not  as  much  as  we  needed.  The 
only  way  to  obtain  the  needed  bilingual  text  was  therefore  to  create  it,  and  for  this 
purpose  we  hired  a  number  of  translation  agencies.  At  an  average  price  of  28  cents  per 
word,  this  is  not  an  inexpensive  operation,  but  neither  is  it  impossible.  We  also  used 
these  agencies  to  create  manual  translations  of  the  Flindi  test  corpus  in  order  to  provide 
ground-truth  data  for  the  Flindi  MT  evaluation. 

In  addition  to  simply  having  bilingual  text,  we  wanted  to  align  those  texts  at  the 
sentence  level.  The  bible  is  available  in  Cebuano  and  in  Flindi,  and  in  effect  constitutes 
parallel  text  aligned  at  the  verse  level.  Flowever,  the  style  and  vocabulary  of  bible 
translation  is  different  enough  from  news  text  that  it  is  desirable  to  align  bilingual  news 
text  as  well.  Because  of  the  way  we  prepared  the  text  to  be  sent  to  translation  agencies,  it 
came  back  already  aligned. 

But  we  had  other  bilingual  texts  that  we  found  on  the  web,  and  annotators  aligned 
these  using  a  manual  alignment  tool  we  had  built  using  AGTK.  The  two-panel  annotation 
tool  displays  the  original  document  and  its  translation  side-by-side;  the  tool  then 
automatically  selects  the  first  sentence  (defined  by  its  punctuation)  in  the  source  data, 
along  with  the  first  sentence  in  the  translation.  If  the  annotator  judges  these  sentences  to 
be  a  translation  pair,  s/he  hits  a  button  to  record  that  judgment.  The  annotation  is  then 
stored  as  a  record  in  a  separate  file,  which  indexes  each  translation  pair  in  terms  of  token 
offsets.  The  tool  then  automatically  selects  the  next  sentence  in  both  the  source  document 
and  the  translation,  and  the  annotator  makes  another  judgment.  If  the  two  sentences 
selected  by  default  are  not  translation  pairs,  the  tool  allows  the  annotator  to  add  or  delete 


words  from  the  selection  in  either  language,  or  to  jump  to  another  complete  sentence 
selection  with  ease. 


Fig.  4. Text  alignment  annotation  tool. 


Morphological  parsers  can  be  found  on  the  web  for  some  languages,  and  indeed  one 
was  located  for  the  Hindi  exercise,  but  we  did  not  find  one  for  Cebuano.  However, 
Cebuano  inflectional  morphology  is  fairly  simple.  We  used  a  grammatical  description  of 
Cebuano  [Bunye  1971],  together  with  a  merged  version  of  the  Cebuano  lexicons  we  had 
found,  to  build  a  morphological  transducer  running  under  the  Xerox  program  xfst 
[Beesley  and  Karttunen  2003].  Writing  the  grammatical  rules  for  the  parser  took  just  a 
few  hours.  Testing  on  news  texts  revealed  a  parse  rate  of  around  60%.  Many  of  the 
“failures”  turned  out  to  be  English  loan  words  that  were  not  listed  in  the  Cebuano 
lexicons,  or  punctuation  or  number  tokens.  Eliminating  these  gave  a  parse  rate  above 


90%. 


Some  of  the  remaining  unparsed  words  in  our  Cebuano  news  texts  were  Spanish 
loans.  Simply  adding  a  Spanish  parser  would  not  work,  however,  beeause  these  loans 
(unlike  most  of  the  English  loans)  are  spelled  aeeording  to  Cebuano  orthography.  Most  of 
the  differenees  are  meehanieal  (e.g.  Cebuano  “k”  eomes  from  Spanish  “e”  or  “qu”).  It 
would  not  be  diffieult  to  eonstruet  a  transdueer  to  eonvert  between  the  two  orthographies, 
and  then  use  a  Spanish  transdueer  to  gloss  the  Spanish  loans  with  English.  But  we  have 
not  attempted  this  step. 

While  (near-)  native  speaker  annotators  were  preferred  for  all  surprise  language 
annotation  tasks,  it  may  be  extremely  diffieult  to  loeate  and  hire  qualified  staff  for  some 
languages;  even  when  skilled  staff  ean  be  identified  they  may  be  unable  to  devote  time  to 
the  projeet  or  may  be  prohibited  from  working  due  to  visa  restrietions.  In  order  to 
minimize  the  need  for  native  speaker  annotators  and  to  reduee  the  amount  of  time 
required  by  eaeh  annotator  to  produee  high-quality  resourees,  LDC  developed  annotation 
praetiees  and  tools  to  make  the  proeess  maximally  effieient.  In  some  eases,  non-native 
speakers  ean  perform  a  substantial  amount  of  initial  annotation  work,  and  this  work  ean 
be  eheeked  over  by  native  speakers.  If  the  native  orthography  of  the  language  is  familiar 
or  a  standard  Romanization  exists,  then  English-speaking  annotators  ean  aehieve  high 
aeeuraey  on  both  the  parallel  text  alignment  task  and  some  parts  of  the  named  entity  task. 
For  parallel  alignment,  punetuation  eues,  eognates,  names,  and  numbers  provide  eues  to 
sentenee  pairs  in  eaeh  language. 

During  the  Cebuano  exereise  this  approaeh  was  used  quite  sueeessfully.  English- 
speaking  annotators  manually  aligned  a  subset  of  the  parallel  text  data;  when  a  native 
Cebuano  speaker  eheeked  over  the  data,  very  few  realignments  were  neeessary,  and  most 
of  those  were  not  eorreetions  but  rather  splitting  larger  ehunks  of  text  into  finer  alignment 
pairs.  For  the  named  entity  task,  personal  names  and  some  organization  and  loeation 
names  may  be  represented  identieally  in  both  English  and  the  target  language,  allowing 


non-native  speakers  to  perform  a  large  portion  of  the  tagging.  Beeause  the  native 
orthography  for  Hindi  does  not  use  Latin  eharaeters,  however,  this  method  eould  not  be 
exploited  during  the  full  surprise  language  exereise;  henee  a  large  team  of  native  Hindi 
speakers  had  to  be  employed  to  ereate  the  range  of  manually  tagged  resourees  required 
for  the  experiment. 

In  general,  annotation  projeets  demand  detailed,  well-tested  guidelines  and 
eustomized  annotation  tools,  eareful  hiring  deeisions  followed  by  long  periods  of 
annotator  training,  and  extensive  quality  assuranee  proeesses,  all  of  whieh  require 
substantial  time  and  effort  to  implement.  These  proeesses  had  to  be  adjusted  signifieantly 
to  meet  the  speeial  demands  of  the  surprise  language  exereise.  While  typieal  annotation 
guidelines  are  quite  extensive  and  aim  to  provide  an  exhaustive  set  of  rules  for  handling 
rare  or  exeeptional  eases  (as  well  as  eovering  typieal  eases),  the  guidelines  developed  for 
the  surprise  language  were  more  limited  and  foeused  only  on  the  most  eommon  types  of 
eonstruetions.  Team  leaders  had  a  deeper  understanding  of  the  full  guidelines,  and  when 
annotators  eneountered  a  eonstruetion  they  did  not  know  how  to  elassify,  the  team  leader 
eould  provide  detailed  instmetion.  This  allowed  the  annotators  to  foeus  on  ereating  data 
rather  than  learning  guidelines  that  may  never  be  applied  to  the  eurrent  task.  This 
approaeh  was  essential  given  both  the  time  eonstraints  imposed  by  the  exereise  and  the 
potentially  limited  pool  of  native  speakers,  let  alone  native-speaking  linguists  or  language 
experts,  available  to  aet  as  annotators  for  a  given  language. 

Instead  of  highly  eustomized  annotation  tools,  for  the  surprise  language  exereise,  we 
developed  a  basie  suite  of  multilingual,  platform-independent  tools  that  eould  be 
reeonfigured  to  meet  the  demands  of  a  partieular  language  or  task.  Modifieations  were 
also  made  to  LDC's  staffing  proeedures  so  that  native  speakers  eould  be  identified, 
interviewed,  hired,  and  trained  as  annotators  within  hours  or  days  rather  than  weeks. 


Such  changes  were  necessary  to  allow  for  rapid  resource  creation  under  the  surprise 
language  context. 

However,  these  divergences  from  LDC's  normal  practices  came  at  a  cost.  Without 
knowing  in  advance  what  the  "surprise"  language  would  be,  annotation  guidelines  were 
necessarily  general  (and  by  default.  English-centric).  For  instance,  some  necessary 
changes  to  the  Hindi  named  entity  guidelines  became  apparent  only  in  the  final  week  of 
the  exercise;  only  then  had  annotators  seen  enough  data  and  learned  enough  about  the 
task  to  fully  understand  why  some  general-purpose  rules  were  not  well  suited  to  Hindi. 
Also,  with  quick  hiring  and  limited  training,  annotation  quality  suffered.  Regular  quality 
assurance  measures  like  second  passing,  dual  annotation,  and  discrepancy  resolution  had 
to  be  skipped  in  order  to  meet  the  aggressive  surprise  language  deadlines.  In  some  cases, 
quick  updates  to  the  annotation  tools  to  make  them  display  Hindi  text  properly  introduced 
new  bugs  that  had  to  be  fixed  before  annotation  could  proceed,  resulting  in  frustrated 
annotators,  panicked  managers,  and  anxious  researchers  (not  to  mention  fed-up 
programmers!).  Ultimately,  it  proved  possible  to  collect  or  create  the  linguistic  resources 
needed  to  enable  technology  development  and  evaluation  in  the  context  of  the  surprise 
language  exercise,  but  not  without  impacting  both  resource  quantity  and  quality. 

As  in  the  case  of  the  language  survey,  we  believe  that  the  annotation  tools  developed 
for  the  surprise  language  exercise  will  be  of  interest  to  those  who  wish  to  create 
comparable  resources  for  other  languages.  While  the  tools  are  currently  optimized  to 
work  within  LDC's  local  operating  environment  and  within  the  surprise  language  context, 
we  intend  to  make  the  toolkit  freely  available,  along  with  documentation  and  perhaps 
training  courses,  once  we  have  completed  further  modifications  and  testing. 


8.  SUMMARY  OF  RESULTS 


The  Cebuano  dry-run  and  the  Hindi  full  exercise  targeted  two  very  different  languages 

from  the  standpoint  of  resource  availability.  We  summarize  here  some  of  the  differences. 

•  Cebuano  was  (relatively)  a  resource-scarce  language,  whereas  abundant 
resources  are  available  for  Hindi — more,  in  fact,  than  we  could  actually  process 
in  a  short  time.  Some  of  the  resources  found  for  each  are  summarized  in  Table  I 
(this  does  not  include  the  encoding  converters  which  were  found  or  built  for 
Hindi,  nor  text-tagged  for  certain  other  purposes,  e.g.,  time  phrases  or  parts  of 
speech). 

•  Cebuano  was  written  in  a  Roman  writing  system,  whereas  Hindi  is  written  in  a 
nonRoman  system.  One  implication  of  this  is  that  it  was  much  easier  for  non¬ 
native  speakers  to  work  with  and  even  annotate  Cebuano  text  than  Hindi  text. 

•  Cebuano  had  a  single  (ASCII)  encoding,  whereas  Hindi  text  appears  on  the  web 
in  numerous  encodings,  forcing  us  to  spend  much  of  our  time  developing 
encoding  converters  to  transliterate  Hindi  texts  into  a  standard  encoding. 


Resource 

Cebuano 

Hindi 

Text  (words) 

250K 

>100M 

Bilingual  text  (words) 

130K 

>5M 

Lexicons  (headwords) 

25K 

30K 

Text  annotated  for  named  entities  (words) 

lOK 

430K 

Text  tagged  for  topic  detection  (documents) 

None 

>2200 

Texts  with  summaries  (documents) 

None 

25 

Morphological  parsers  or  stemmers 

1 

4 

Table  I.  Major  Resources  for  Cebuano  and  Hindi 


We  also  summarize  some  important  factors  that  affected  our  work  in  more  or  less  the 
same  way  for  Cebuano  and  Hindi: 

•  We  did  not  find  sufficient  bilingual  news  text  for  our  needs  in  either  language; 

we  were  therefore  forced  to  create  translations  using  translation  agencies  or 


other  means. 


While  we  found  lists  of  entities  in  both  languages,  it  was  still  necessary  to  do 
manual  annotation  of  named  entities  in  texts. 


•  Manual  annotation  was  also  required  to  provide  training  data  for  a  range  of 

technologies. 

The  fact  that  the  two  exercises  targeted  quite  different  languages  for  purposes  of  NLP 
gives  us  some  confidence  that  we  can  base  our  future  resource  collection  and  creation 
efforts  on  these  experiences. 

We  should  however  note  that  we  did  not  face  a  major  problem  from  morphology  for 
either  language;  we  had  a  morphological  parser  for  Cebuano  and  we  found  one  for  Hindi 
(and  one  site  developed  a  simple  stemmer).  Had  we  faced  a  language  with  a  more 
complex  morphology,  and  were  unable  to  find  an  existing  morphological  parser,  we 
would  have  had  to  expend  considerably  more  effort  in  building  a  parser  (or  stemmer). 
While  some  effort  has  gone  into  machine-learning  approaches  to  morphology  [Maxwell 
2002],  the  state  of  the  art  is  not  up  to  the  automatic  creation  of  morphological  parsers  or 
stemmers  for  languages  with  any  degree  of  complexity  in  their  morphology. 

9.  FUTURE:  A  CALL  FOR  COLLABORATION 

Research  on  the  rapid  porting  of  linguistic  technologies  to  new  languages  is  crucial,  as  it 
helps  determine  the  most  efficient  porting  methods  and  encourages  cost-benefit  analyses 
of  the  types  and  sizes  of  linguistic  resources  necessary.  However,  it  will  always  be 
preferable  to  avoid  the  scramble  inherent  in  rapid  porting  by  preparing  and  providing 
core  linguistic  resources  in  advance  of  need.  Therefore  we  propose  an  initiative  to  begin 
collecting  the  resources  necessary  to  develop  critical  language  technologies  in  all  target 
languages. 

Necessary  resources,  varying  both  with  the  technology  and  the  target  language,  are 
open  to  negotiation.  However  we  propose  that  a  core  include  significant  bodies 


(minimally  100,000  words)  of  electronic  text  and  parallel  text,  medium-sized  translation 
lexicons  (10,000  words),  and  entity  databases  and  texts  tagged  for  entities  and  topics. 
Such  resources  support  information  access  technologies  that  work  with  text  and  are 
simple  enough  that  it  should  be  possible  to  locate  them  rapidly  for  a  large  number  of 
languages.  Although  there  are  numerous  other  desirable  resources,  we  propose  that  this 
initiative  begin  with  attainable  goals  in  order  to  maximize  the  probability  of  early 
success.  The  choice  of  target  languages  is  similarly  open  to  negotiation,  but  we  propose 
that  the  work  continue  targeting  larger  languages  in  priority  order. 

Nevertheless,  it  is  clear  that  LDC  cannot  tackle  this  task  of  language  documentation 
for  any  large  number  of  languages,  even  with  the  help  of  the  other  sites  involved  in  the 
TIDES  surprise  language  exercise.  Accordingly,  we  invite  a  global  participation  in  this 
effort.  Participants  would  define  their  local  priorities  in  collaboration  with  other 
interested  groups  working  in  the  same  or  related  languages.  We  are  encouraged  by  recent 
efforts  of  European  initiatives  like  ELSNET  and  ENABLER  in  this  regard,  and  hope  to 
work  with  these  groups  to  develop  approaches  and  resources  that  are  both  complementary 
and  compatible. 

Whatever  the  approach  adopted  for  the  documentation,  collection,  development,  and 
distribution  of  resources  for  any  particular  language,  we  propose  four  principles  to 
coordinate  the  effort: 

(1)  Individual  participants  will  conduct  language  resource  surveys  and  will  identify, 
collect,  and  further  develop  linguistic  resources,  making  the  results  available  to  the 
whole  group.  Access  to  the  group  survey  results  would  be  contingent  upon 
substantive  contribution  to  the  effort. 

(2)  Although  many  of  the  targeted  resources  are  already  available  on  the  Internet  for 
research  purposes,  world-wide  resource  providers  will  need  to  engage  data  creators  in 
intellectual  property  negotiations  in  order  to  secure  distribution  rights  and  then 


distribute  resources  through  existing  channels.  This  will  add  value  to  the  raw 
resources  by  creating  corpora  that  are  stable,  consistently  structured,  and  capable  of 
being  used  for  a  variety  of  purposes. 

(3)  Participants  are  encouraged  to  use  standard,  freely  available  tools  (such  as  the 
annotation  tools  we  describe  above)  in  order  to  encourage  ongoing  resource  creation 
in  a  framework  that  promotes  easy  exploitation  of  its  results  by  the  widest  possible 
audience. 

(4)  The  resources  created  through  this  process  should  be  made  available  to  the  world¬ 
wide  community  of  researchers  using  an  archival  distribution  method,  and  indexed  so 
that  other  researchers  can  find  the  resources  and  make  use  of  them,  while  respecting 
intellectual  property  rights  [Bird  and  Simons  2003]. 

Our  experience  with  the  many-language  resource  survey,  the  rapid  collection, 
development,  and  dissemination  of  linguistic  resources  and  the  highly  collaborative 
framework  of  the  surprise  language  exercise  lead  us  to  believe  that  a  broader,  more 
ambitious,  effort  is  not  only  possible  but  obligatory,  given  the  current  state  of  language 
technologies  and  the  focus  of  technology  programs  world-wide. 
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